Hybrid-captioning system

ABSTRACT

A hybrid-captioning system for editing captions for spoken utterances within video includes an editor-type caption-editing subsystem, a line-based caption-editing subsystem, and a mechanism. The editor-type subsystem is that in which captions are edited for spoken utterances within the video on a groups-of-line basis without respect to particular lines of the captions and without respect to temporal positioning of the captions in relation to the spoken utterances. The line-based subsystem is that in which captions are edited for spoken utterances within the video on a line-by-line basis with respect to particular lines of the captions and with respect to temporal positioning of the captions in relation to the spoken utterances. For each section of spoken utterances within the video, the mechanism is to select the editor-type or the line-based subsystem to provide captions for the section of spoken utterances in accordance with a predetermined criteria.

FIELD OF THE INVENTION

The present patent application relates generally to generating captionsfor spoken utterances within video, and more particularly to ahybrid-captioning system to generate such captions and that employs botheditor-type caption-editing and line-based caption-editing.

BACKGROUND OF THE INVENTION

Just as a caption in a book is the text under a picture, captions onvideo are text located somewhere on the picture. Closed captions arecaptions that are hidden in the video signal, invisible without aspecial decoder. The place they are hidden is called line 21 of thevertical blanking interval (VBI). Open captions are captions that havebeen decoded, so they become an integral part of the television picture,like subtitles in a movie. In other words, open captions cannot beturned off. The term “open captions” is also used to refer to subtitlescreated with a character generator.

Within the prior art, captions are commonly generated by voicerecognition, manual human entry, or a combination of these techniques.Once generated by either approach, the captions have to be edited. Inparticular, the captions may have to be proofread for correctness, andproperly and appropriately keyed to the video itself if not alreadyaccomplished by the caption-generation process. For instance, a givencaption may have a timestamp, or temporal position, in relation to thevideo that indicates when the caption is to be displayed on the video.Furthermore, a caption may have a particular location at which to bedisplayed. For example, if two people on the video are speaking with oneanother, captions corresponding to spoken utterances of the left-mostperson may be placed on the left part of the video, and captionscorresponding to spoken utterances of the right-most person may beplaced on the right part of the video.

Within the prior art, there are three general types of conventionalcaption-editing systems. First, there is an editor-type caption-editingsystem, in which captions are edited for spoken utterances within videoon a groups-of-line basis, without respect to particular lines of thecaptions and without respect to temporal positioning of the captions inrelation to the spoken utterances. Such a caption-editing system mayeven include multiple-line editing capabilities within computer programslike word processors. In this type of system, there is no timestampingof the captions to the video, since the captions are generated for thevideo, or sections of the video, as a whole, without regard to temporalpositioning. This type of system is also commonly referred to as“summary writing” or “listening dictation.” This type of system isuseful where there are many errors in the captions themselves, sinceediting can be accomplished without regards to the different lines ofthe captions temporally corresponding to different parts of the video.However, it does require temporal positioning—i.e., timestamping—tolater be added, which is undesirable.

Second, there is a line-based caption-editing system, in which captionsare generated for spoken utterances within video on a line-by-line basiswith respect to particular lines of the captions and with respect totemporal positioning of the captions in relation to the spokenutterances. Line-based caption-editing systems thus operate in relationto timestamps of the captions in relation to the video, on a captionline-by-caption line basis. This type of system is very effective forcaptions that are generated without errors, especially since temporalpositioning—i.e., timestamping—is accomplished as part of the captioningprocess. However, where there are many errors within the captions,correction can become difficult, since the temporal positioning of thelines may become incorrect as a result of modification of the linesthemselves. For instance, lines may be deleted, added, or merged, in theprocess of editing, which can render the previous temporalpositioning—i.e., timestamping—incorrect, which is undesirable as well.

A third type of caption-editing system is a respeaking caption-editingsystem. In respeaking, a specialist with a proven high voice-recognitionrate respeaks the voices of various speakers on video, in order toconvert them into voices with a higher voice-recognition rate. Thisapproach is disadvantageous, however, because it is very laborintensive, and requires the utilization of highly skilled labor, in thatonly people who have proven high voice-recognition rates should respeakthe voices of the speakers on the video. Thus, of the three types ofcaption-editing systems within the prior art, the editor-type system isuseful where voice recognition results in many errors, the line-basedsystem is useful where voice recognition results in few errors, and therespeaking system is relatively expensive.

In a given video, however, there may be sections in which voicerecognition achieves a high degree of accuracy on the spoken utterancesin question, and there may be other sections in which voice recognitiondoes not achieve a high degree of accuracy on the spoken utterances inquestion. Therefore, using an editor-type caption-editing systemachieves good results for the latter sections but not for the formersections. By comparison, using a line-based caption-editing systemachieves good results for the former sections but not for the lattersections. Therefore, there is a need for achieving good caption resultsfor all sections of video, regardless of whether the voice recognitionyields accurate results or not. For this and other reasons, there is aneed for the present invention.

SUMMARY OF THE INVENTION

The present invention relates to a hybrid-captioning system for editingcaptions for spoken utterances within video. The system in oneembodiment includes an editor-type caption-editing subsystem, aline-based caption-editing subsystem, and a mechanism. The editor-typesubsystem is that in which captions are edited for spoken utteranceswithin the video on a groups-of-line basis without respect to particularlines of the captions and without respect to temporal positioning of thecaptions in relation to the spoken utterances. The line-based subsystemis that in which captions are edited for spoken utterances within thevideo on a line-by-line basis with respect to particular lines of thecaptions and with respect to temporal positioning of the captions inrelation to the spoken utterances. For each section of spoken utteranceswithin the video, the mechanism is to select the editor-type subsystemor the line-based subsystem to provide captions for the section ofspoken utterances in accordance with a predetermined criteria.

For instance, this criteria may be the certainty level (i.e., theaccuracy) of voice recognition that has been performed as to a givensection of the spoken utterances within the video to perform the initialgeneration of the captions for that section. Where the certainty levelis greater than a predetermined threshold, the mechanism selects theline-based caption-editing subsystem to ultimately provide the captionsfor this section of spoken utterances. However, where the certaintylevel is not greater than the predetermined threshold, the mechanisminstead selects the editor-type caption-editing subsystem to ultimatelyprovide the captions for this section of spoken utterances.

A method of an embodiment of the invention, in relation to video forwhich captions are to be generated, receives user input as to a correctsection of the video for which captions have been initially generated.The user input is received with an editor-type caption-editingsubsystem. Where the user input corresponds to termination of theeditor-type caption-editing subsystem—i.e., where the user hasterminated editing of these captions within this subsystem—the followingis accomplished. First, the captions are transmitted to ageneral-matching subsystem. The general-matching subsystem transmits thecaptions to a line-based caption-editing subsystem. If the user inputdoes not correspond to termination of the editor-type caption-editingsubsystem, however, then the method transmits the captions to aparticular-matching subsystem (i.e., a different matching subsystem),which transmits the captions back to the editor-type subsystem.

An article of manufacture of an embodiment of the invention includes atangible computer-readable data storage medium, and means in the medium.The means may be a computer program, for instance. The means is forselecting an editor-type caption-editing subsystem or a line-basedcaption-editing subsystem to provide captions for each of a number ofsections of spoken utterances of video, in accordance with apredetermined criteria, such as that which has been described.

Embodiments of the invention provide for advantages over the prior art.Within a given video, there may be sections of spoken utterances forwhich caption editing is best achieved via editor-type caption editing,and other sections of spoken utterances for which caption editing isbest achieved via line-based caption editing. Accordingly, embodimentsof the invention provide for a hybrid-captioning system, in which botheditor-type caption editing and line-based caption editing are both ableto be achieved, depending on the section of spoken utterances of thevideo in question. By comparison, the prior art always forces a user tochoose either line-based caption editing or editor-type caption editing,without letting a user use the former type of editing on captions forsome sections of video, and the latter type of editing on captions forother sections of video.

For instance, a section of spoken utterances within the video that has ahigh certainty level of voice recognition may be edited within aline-based caption-editing subsystem of the inventive hybrid-captioningsystem, since line-based caption editing is most appropriate forcaptions having such high degrees of voice recognition accuracy orcertainty. As another example, another section of spoken utteranceswithin the video that has a low certainty level of voice recognition maybe edited within an editor-type caption-editing subsystem of theinventive hybrid-captioning system, since editor-type caption editing ismost appropriate for captions having such low degrees of voicerecognition accuracy or certainty. Still other advantages, aspects, andembodiments of the invention will become apparent by reading thedetailed description that follows, and by referring to the accompanyingdrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings referenced herein form a part of the specification.Features shown in the drawing are meant as illustrative of only someembodiments of the invention, and not of all embodiments of theinvention, unless otherwise explicitly indicated, and implications tothe contrary are otherwise not to be made.

FIG. 1 is a diagram of a hybrid-captioning system, according to ageneral and preferred embodiment of the invention, and is suggested forprinting on the first page of the patent.

FIG. 2 is a diagram of a hybrid-captioning system, according to a moredetailed embodiment of the invention.

FIG. 3 is a flowchart of a method for hybrid captioning to edit,including to generate, captions for spoken utterances within video,according to an embodiment of the invention.

FIG. 4 is a flowchart of a method for caption-video matching prior toand/or in accordance with line-based caption editing, according to anembodiment of the invention.

FIG. 5 is a flowchart of a method for caption-video matching prior toand/or in accordance with editor-type caption editing, according to anembodiment of the invention.

FIG. 6 is a flowchart of a method for general timestamp matching betweencaptions and spoken utterances within video, according to an embodimentof the invention.

DETAILED DESCRIPTION OF THE DRAWINGS

In the following detailed description of exemplary embodiments of theinvention, reference is made to the accompanying drawings that form apart hereof, and in which is shown by way of illustration specificexemplary embodiments in which the invention may be practiced. Theseembodiments are described in sufficient detail to enable those skilledin the art to practice the invention. Other embodiments may be utilized,and logical, mechanical, and other changes may be made without departingfrom the spirit or scope of the present invention. The followingdetailed description is, therefore, not to be taken in a limiting sense,and the scope of the present invention is defined only by the appendedclaims.

FIG. 1 shows a hybrid-captioning system 100, according to an embodimentof the invention. The system 100 is depicted in FIG. 1 as including aneditor-type caption-editing subsystem 102, a line-based caption-editingsubsystem 104, and a selection mechanism 106. Both the subsystem 102 andthe subsystem 104, as well as the mechanism 106, may be implemented insoftware, hardware, or a combination of software and hardware. As can beappreciated by those of ordinary skill within the art, the system 100may include other components, in addition to and/or in lieu of thosedepicted in FIG. 1.

The editor-type caption-editing subsystem 102 may in one embodiment beimplemented as is conventional, and in another embodiment may bemodified to provide additional functionality as is described later inthe detailed description. In general, the editor-type subsystem 102provides for the editing, including the generation, of captions forspoken utterances within video 108, which includes moving pictures andcorresponding sound. This editing is provided for on a groups-of-captionlines basis, without respect or regards to particular lines of thecorresponding captions 112, and without respect to temporal positioningof the captions 112 in relation to the spoken utterances within thevideo 108, as has been described in more detail in the backgroundsection.

The line-based caption-editing subsystem 104 may in one embodiment beimplemented as is conventional, and in another embodiment may bemodified to provide additional functionality as is described later inthe detailed description. In general, the line-based subsystem 104provides for the editing, including the generation, of captions forspoken utterances within the video 108, as with the subsystem 102. Thisediting, however, is provided for on a line-by-line basis with respectto particular lines of the captions 112 and with respect to temporalpositioning, or timestamping, of these captions 112 in relation to thespoken utterances within the video 108, as has been described in moredetail in the background section.

The video 108 itself can be considered as having a number of sections110A, 110B, . . . , 110N, collectively referred to as the sections 110.The captions 112 themselves can be considered as having a number ofcorresponding groupings of lines 114A, 114B, . . . , 114N, collectivelyreferred to as the lines 114. Thus, for each of the sections 110 of thevideo 108, a corresponding one or more lines of the lines 114 isinitially generated as the captions for the spoken utterances withinthat section of the video 108. In one embodiment, voice recognition, ora user manually listening to the video 108, is accomplished to generatethe lines 114 corresponding to the sections 110.

Thereafter, the mechanism 106 determines which of the subsystems 102 and104 is to achieve ultimate editing, and thus ultimate generation, of thelines of the captions 112 corresponding to a given section of the video108, based on or in accordance with a predetermined criteria. Forinstance, in one embodiment, the voice recognition of one or moreportions of a given section of the video 108 is sampled or tested todetermine the certainty or accuracy level of that voice recognition. Ifthe certainty or accuracy level of the voice recognition results isrelatively high (i.e., above a threshold), then the mechanism 106selects the line-based subsystem 104 to provide subsequent editing andgeneration of the corresponding captions. However, if the certainty oraccuracy level of the voice recognition results is relatively low (i.e.,not above a threshold), then the mechanism 106 selects the editor-typesubsystem 102 to provide subsequent editing and generation of thecorresponding captions. This selection is achieved for each of thesections 110 of the video 108 for which corresponding of the lines 114of the captions 112 have been initially generated.

Thus, the mechanism 106 allows for optimal editing and generation of thecaptions 112 for the video 108, even where some sections of the video108 are best handled in relation to the editor-type subsystem 102, andother sections of the video 108 are best handled in relation to theline-based subsystem 104. It is noted that the video 108 may bereal-time video, or recorded video. Furthermore, the captions 112 may beopen captions, or closed captions, as have been defined in thebackground section. Finally, it is noted that in one embodiment, thecaptions 112 are generated independent of the input path by which thevideo 108 is generated and by which voice recognition on the video 108is achieved. That is, the hybrid-captioning system 100 is independent ofany particular type of microphone, for instance, as well as isindependent of any particular requirements as to the file locations, andthus file paths, of the video 108 and the captions 112, as can beappreciated by those of ordinary skill within the art.

FIG. 2 shows the hybrid-captioning system 100 in more detail, accordingto an embodiment of the invention. The system 100 of the embodiment ofFIG. 2 is thus more detailed than but consistent with the system 100 ofthe embodiment of FIG. 1. The selection mechanism 106 is not depicted inFIG. 2, but rather the functionality that the selection mechanism 106performs—in parts 212 and 214—is instead depicted in FIG. 2.

The speech utterances of the video 108 are input into avoice-recognition mechanism 202, which may be implemented in hardware,software, or a combination of hardware and software. Thevoice-recognition mechanism 202 recognizes voice within these speechutterances, either with or without human intervention, and thusgenerates an initial version of the captions 112 for the video 108.These voice recognition results are stored in a storage device 204. Thevoice-recognition mechanism 202 may be implemented in one embodiment asis conventional.

Furthermore, any initial timestamping of which voice recognition results(i.e., which of the captions 112) correspond to which parts of the video108 is stored in the storage device 204 and/or the storage device 206,as well as the individual phonemes of the speech utterances of the video108 on which basis voice recognition was achieved. A phoneme is thesmallest phonetic unit in a language that is capable of conveying adistinction in meaning, as the m of mat and the b of bat in English.Finally, the certainty or accuracy level of the voice recognitionresults, on a section-by-section basis of the sections 110 of the video108, is stored in the storage device 204 and/or the storage device 206.This certainty or accuracy level may also be yielded with or withouthuman intervention, as can be appreciated by those of ordinary skillwithin the art.

The voice-recognition mechanism 202 then passes control to thehybrid-captioning system 100. For each section of the video 108, theselection mechanism 106 acquires the certainty level of the resultingvoice recognition results (i.e., the initial version of the captions 112for this section), in part 212. Where the certainty level is greaterthan a predetermined threshold, such as 75% out of 100%, then themechanism 106 provides for ultimate generation and editing of thecaptions 112 in question by the line-based caption-editing subsystem104, as indicated in part 214. As also indicated in part 214, where thecertain level is not greater than this threshold, then the mechanism 106provides for ultimate generation and editing of the captions 112 inquestion by the editor-type subsystem 102.

It is noted that the editor-type subsystem 102 is able to interact witha storage device 206 of the hybrid-captioning system 100, and ageneral-matching subsystem 208 and a particular-matching subsystem 210of the hybrid-captioning system 100. The line-based subsystem 104 isalso able to interact with the storage device 206 and the subsystems 208and 210. However, in the embodiment depicted specifically in FIG. 2,this latter interaction is not direct, but rather is indirectlyaccomplished through the editor-type subsystem 102. In anotherembodiment, though, the interaction between the line-based subsystem 104and the storage device 206 and the subsystems 208 and 210 may instead bedirect. It is noted that the ultimate output by the editor-typesubsystem 102 and the line-based subsystem 104 is the captions 112.

The storage device 206 stores various information, including user inputsto be provided to the subsystems 102 and 104, as well as characterstrings stored by the subsystems 208 and 210 and provided to thesubsystems 102 and 104, as will be described, and also timestamps, whichmay also be stored by the subsystems 208 and 210 and provided to thesubsystems 102 and 104, as will also be described. The subsystems 208and 210 may each be implemented in hardware, software, or a combinationof hardware and software. The general-matching subsystem 208 isspecifically that which is used in relation to the line-based subsystem104, and the particular-matching subsystem 210 is specifically thatwhich is used in relation to the editor-type subsystem 102.

The general-matching subsystem 208 is to match spoken utterances of agiven section of the video 108 to the captions 112 that have beengenerated (i.e., the voice recognition results) for this section, as isdescribed in more detail later in the detailed description. A section ofthe video 108 in this respect corresponds to one or more individuallydemarcated lines of the lines 114 of the captions 112. That is, sincethe subsystem 208 is used for the line-based subsystem 104, which isused where voice-recognition results are relatively high, the individuallines of the captions 112 corresponding to this section of the video 108thus will have been accurately demarcated.

By comparison, the particular-matching subsystem 210 is to match spokenutterances of a given section of the video 108 to the captions 112 thathave been generated (i.e., the voice recognition results) for thissection, as is also described in more detail later in the detaileddescription. However, a section of the video 108 in this respectcorresponds to a group of lines of the lines 114 of the captions 112that are not demarcated. That is, since the subsystem 210 is used forthe editor-type subsystem 102, which is used where voice-recognitionresults are relatively low, the individual lines of the captions 112corresponding to this section of the video 108 thus will not have beendemarcated at all.

For example, the section 110A of the video 108 corresponds to the lines114A of the captions 112. Now, where the voice-recognition results forthis section 110A are relatively high (i.e., relatively accurate, abovea threshold, and so on), then the lines 114A will include one or morelines that are individually demarcated in relation to one another. Forinstance, there may be three lines, which are individually demarcated asdifferent lines. Such individually demarcated lines are most suitablefor line-based caption editing, as is performed in the line-basedcaption-editing subsystem 104, and as has been described in thebackground section.

However, where the voice-recognition results for this section 110A arerelatively low (i.e., relatively inaccurate, below a threshold, and soon), then the lines 114A may still include one or more lines, but theyare not individually demarcated in relation to one another, but ratherare considered a single grouping of lines. For instance, there may bethree separate lines, but for purposes of captioning, these three linesare just considered part of the same grouping. Such a grouping of lines,without individual line demarcation, are most suitable for editor-typecaption editing, as is performed in the editor-type caption-editingsubsystem 102, and as has been described in the background section.

FIG. 3 shows a method 300 for performing hybrid captioning, according toan embodiment of the invention. The method 300 is performed in relationto the various components of the system 100, such as that which has beendescribed in relation to FIG. 2 and/or FIG. 3. However, the method 300provides the components of the system 100 with slightly differentfunctionality as compared to that which has been described above. Inparticular, as will become apparent, the editor-type subsystem 102 hasinitial and primary processing responsibility, and hands over processingto the other subsystems 104, 208, and 210 as needed. It is also notedthat not described in relation to the method 300 (and the other methodssubsequently described) is that the various components of the system 100can employ the information stored in the storage devices 204 and 206 ofFIG. 2 as needed and as necessary, such that access to these devices 204and 206 is not described in relation to the method 300 (and the othermethods subsequently described).

First, user input is received by the editor-type subsystem 102 (302).The user input is in relation to a current section of the video 108 forwhich captions have been initially generated, such as by thevoice-recognition mechanism 202 of FIG. 2, for instance. The user mayspecify via this input that the editor-type subsystem 102 is not to beused to edit or finalize (i.e., ultimately generate) these captions(304), in which case the general-matching subsystem 208 is entered(306), and thereafter the line-based subsystem 104 is entered (308).That is, it can be said that the captions are transmitted to thesubsystem 208, and then to the subsystem 104. The functionalityperformed by the subsystem 208 is as has been described, and as isdescribed in more detail later in the detailed description.

If the user has not entered input specifying that the editor-typesubsystem is not to be used (304), then the method 300 proceeds torecord one or more keys, timestamps, and/or characters (312). That is,because the editor-type subsystem 102 is providing the finalization ofthe captions, the user has to enter the manually the keys to which thevarious lines over which the captions are to be divided, since thecaptions for this section of the video 108 itself is a group of linesthat is not demarcated. Thus, the keys can correspond to thedemarcations of the captions into a number of lines. Likewise, the usercan enter the timestamps of the video 108 at which these linescorrespond, to indicate when these lines are to be displayed. Finally,the user may enter in one or more characters of the lines, or delete ormodify characters of the lines of the captions as may be preexisting dueto the earlier generation by the voice-recognition mechanism 202.

Thereafter, the particular-matching subsystem 210 is entered (314). Thatis, the captions for the section of the video 108 in question, asmodified by the recorded information in part 312, can be said to betransmitted to the subsystem 210. The functionality performed by thesubsystem 210 is as has been described, and as is described in moredetail later in the detailed description. Thereafter, the editor-typesubsystem 102 is reentered, such that it can be said that the captionsfor the section of the video 108 in question, as may have been modifiedby the subsystem 210, are transmitted back to the editor-type subsystem102.

If the particular-matching subsystem 210 generated any predictedcharacter string as part of the captions (316), then such predictedcharacter strings are presented to the user within the editor-typesubsystem 102 (318). In either case, thereafter, if a new line withinthe captions has been (temporarily) determined by theparticular-matching subsystem 210 (320), then the line-based subsystem104 is entered (308). That is, if the particular-matching subsystem 210has itself divided the captions into one or more new lines, then it isnow appropriate for the line-based subsystem 104 to perform processing.These new lines are temporary lines, since the line-based subsystem 104may modify them further, as is conventional. In addition, entry of theline-based subsystem 104 is said to be considered transmission of thecaptions to the subsystem 104.

If there are no new lines of the captions for the section of the video108 in question temporarily determined by the particular-matchingsubsystem 210 (320), then the method 300 is finished (310). Likewise,once processing by the line-based subsystem 104 is finished (308), themethod 300 is finished (310). Processing by the line-based subsystem 104may be accomplished as is conventional, as can be appreciated by thoseof ordinary skill within the art, where a summary of such functionalityhas been described earlier in the detailed description and in thebackground. It is noted that the method 300 may be repeated for eachsection of the video 108, until all the sections of the video 108 havebeen processed insofar as captioning is concerned.

FIG. 4 shows a method 400 that can be performed by the general-matchingsubsystem 208 of FIG. 2 when it is entered to perform functionality onthe captions for a given section of the video 108, according to anembodiment of the invention. First, character-based line matching isperformed (402). Matching in this respect means matching the captionswith the video 108, so that the captions can be correctly aligned withthe video 108, such that when the video 108 is played back, the captionsare displayed at the correct time. Character-based line matching meansthat lines of the captions are matched to spoken utterances within thevideo 108 based on the characters of the lines themselves.

The manner by which matching in general is performed is not limited byembodiments of the invention. Any particular approach or technique thatyields satisfactory results can be used, for instance. In oneembodiment, dynamic programming (DP) models and techniques can beemployed, as understood by those of ordinary skill within the art. DPrefers to a collection of algorithms that can be used to compute optimalpolicies given a perfect model of the environment, such as a Markovdecision process (MDP). Classical DP algorithms are of limited utilityin reinforcement learning both because of their assumption of a perfectmodel and because of their great computational expense, but they arestill very important theoretically. Thus, modified DP algorithms caninstead be employed which do not require the assumptions and rigor ofclassical DP algorithms.

If all the captions are not successfully matched via character-basedline matching (404), then phoneme-based character matching is performedon the remaining captions (406). Phoneme-based character matches usephonemes in addition to the individual characters. A DP approach mayalso be used with phoneme-based character matching. Again, matching inthis respect means matching the portion of the video 108 to which thecaptions in question correspond, so that the captions are properlydisplayed on the video 108 as the video 108 is played back.

Next, a previously determined timestamp matching is received (408). Thistimestamp matching is a previously determined temporal positioning ofthe captions in relation to the spoken utterances of the current sectionof the video 108. This timestamp matching may have been accomplished bythe voice-recognition mechanism 202, and stored in the storage device204, such that it can be said that the general matching subsystem 208retrieves this matching from the storage device 204. It is noted thatthe general matching subsystem 208 is entered before the line-basedsubsystem 104 is entered, such that it can be presumed that suchtimestamping has been previously performed, or otherwise the particularmatching subsystem 210 and the editor-type subsystem 102 would have beenentered to process the captions in question.

Because the timestamping achieved by the voice-recognition mechanism202, for instance, may vary from the matching that has been achieved onthe basis of characters and/or phonemes, there may be discrepanciesbetween the two that need rectification. If there are any so-calledcorrections to be made to the timestamp matching (410)—where thecorrections result from the character-based or phoneme-basedmatching—then these corrections are returned (412) for laterrectification by, for instance, the line-based subsystem 104.Ultimately, then, the method 400 is finished (414).

FIG. 5 shows a method 500 that can be performed by theparticular-matching subsystem 210 of FIG. 2 when it is entered toperform functionality on the captions for a given section of the video108, according to an embodiment of the invention. First, thevoice-recognition rate of the captions initially generated, such as bythe voice-recognition mechanism 202 of FIG. 2, is determined (502). Suchdetermination may be accomplished by looking up this rate within astorage device, like the storage device 204 of FIG. 2. Alternatively, auser may be presented with a sample of the captions, and asked to listento the original video 108 to determine its accuracy.

If this rate is not greater than a predetermined threshold, such as 75%accuracy (504), then the following is performed to in effect redo thecaption line matching that was achieved via the initial voicerecognition. In particular, phoneme-based character matching isperformed (506), where such phoneme-based character matching may beaccomplished via a DP algorithm, as has been described. If the resultingof such matching is an accuracy rate that is still not greater than thepredetermined threshold, such as 75% accuracy (508), then the method 500returns that a “no matching” error (512), and is finished (514). Thatis, if phoneme-based character matching still cannot improve theaccuracy or certainty rate greater than the threshold, then anindication is returned that matching captions to the video 108 was notable to be achieved. As before, a user may be requested to verify thatphoneme-based character matching was accurate to determine this accuracyrate of such matching, or another approach may be employed.

However, if the initial voice recognition yielded matching greater thanthe threshold in part 504, or the resultingly performed phoneme-basedcharacter matching yielded accuracy greater than the threshold in part508, then the predicted character strings of such captions for thesection of the video 108 in question are returned (510), and the method500 is finished (514). Such returning of the predicted character stringsare thus for later transmission to, for instance, the editor-typesubsystem 102 in one embodiment. The character strings may also bereferred to as one or more temporarily matched lines. That is, thecharacter strings represent the matching of the captions to the sectionof the video 108 in question, with respect to temporal positioning ortimestamping thereof.

Finally, FIG. 6 shows a method 600 for performing general timestamping,according to an embodiment of the invention. The timestamping achievedby the method 600 is essentially conventional timestamping, but isdescribed herein for completeness. The timestamping of the method 600may be that which is ultimately received by the method 400 in part 408of FIG. 4, for instance.

First, the method 600 determines whether all the caption lines of thecaptions initially determined or generated for the current section ofthe video 108 in question have been matched to the video 108 (602).Where this is the case, the method 600 proceeds to part 610, as will bedescribed. Where this is not the case, however, offset phoneme matchingis performed (604) to attempt to yield such matching. Offset phonemematching is a particular type of phoneme matching, as understood bythose of ordinary skill within the art, and may be performed byutilizing a DP algorithm, as has been described. If offset phonemematching yields matching of all the caption lines to the video 108(606), then the method 600 proceeds to part 610, as will be described.Otherwise, offset phonemes are allocated to achieve a rudimentarymatching of the captions to the video 108 (608), as understood by thoseof ordinary skill within the art.

Therefore, ultimately in some manner all the captions for the currentsection of the video 108 in question have been matched to the video 108,in that they have been temporally synchronized with the video 108. As aresult, the next step is to actually generate the timestampscorresponding to these temporal synchronizations. This process startswith the beginning of the captions for the current section of the video108 in question. In particular, the next punctuation, word, or clausewithin the captions is detected, or advanced to (610), where in thisparticular instance this is the first punctuation, word, or clause.

If all the captions have been processed as a result of such detection oradvancing (612), then ultimately the method 600 is finished (620).However, where there is still a portion of the captions that have notyet been so processed, then the method 600 continues by determiningwhether a line has been exceeded or divided (614). That is, if detectionof the next punctuation, word, or clause results in advancement from oneline to another line, then the test in part 614 is true. For instance,one line may be “THE CAT JUMPED” and the next line may be “OVER THEBAG.” When proceeding from the word “JUMPED” to the word “OVER,” suchthat the word that was most recently detected or advanced to in part 610is the word “OVER,” the former line is advanced from and the latter lineis advanced to, such that the test in part 614 is true.

In this case, the timestamp of the very next character within thecaptions for the section of the video 108 in question is determined(616) and returned (618) as corresponding to the line that has beenadvanced to. For instance, in the example of the previous paragraph, thetimestamp of the space character following the word “OVER” is returnedas the timestamp corresponding to the line “OVER THE BAG.” Followingpart 618, or where the test of part 614 is false or negative, the method600 is repeated starting at part 610.

It is noted that, although specific embodiments have been illustratedand described herein, it will be appreciated by those of ordinary skillin the art that any arrangement calculated to achieve the same purposemay be substituted for the specific embodiments shown. This applicationis thus intended to cover any adaptations or variations of embodimentsof the present invention. Therefore, it is manifestly intended that thisinvention be limited only by the claims and equivalents thereof.

1. A hybrid-captioning system to edit captions for spoken utteranceswithin video comprising: an editor-type caption-editing subsystem inwhich captions are edited for spoken utterances within the video on agroups-of-lines basis without respect to particular lines of thecaptions and without respect to temporal positioning of the captions inrelation to the spoken utterances; a line-based caption-editingsubsystem in which captions are edited for spoken utterances within thevideo on a line-by-line basis with respect to particular lines of thecaptions and with respect to temporal positioning of the captions inrelation to the spoken utterances; and, a mechanism to, for each sectionof spoken utterances within the video, select the editor-typecaption-editing subsystem or the line-based caption-editing subsystem toprovide captions for the section of spoken utterances in accordance witha predetermined criteria.
 2. The hybrid-captioning system of claim 1,wherein the mechanism, for each section of spoken utterances within thevideo, selects the editor-type caption-editing subsystem or theline-based caption-editing subsystem to provide the captions for thesection of spoken utterances based on a certainty level of voicerecognition as to the section of spoken utterances.
 3. Thehybrid-captioning system of claim 2, wherein, for each section of spokenutterances within the video, where the certainty level of voicerecognition as to the section of spoken utterances is greater than apredetermined threshold, the mechanism selects the line-basedcaption-editing subsystem to provide the captions for the section ofspoken utterances, and otherwise selects the editor-type caption-editingsubsystem to provide the captions for the section of spoken utterances.4. The hybrid-captioning system of claim 1, wherein the mechanism, foreach section of spoken utterances within the video, selects theeditor-type caption-editing subsystem or the line-based caption-editingsubsystem to provide the captions for the section of spoken utterancesbased on voice recognition results on a sample of the spoken utteranceswithin the section of spoken utterances.
 5. The hybrid-captioning systemof claim 1, wherein the mechanism selects the line-based caption-editingsubsystem for the sections of spoken utterances in which voicerecognition results have a relatively high level of accuracy and selectsthe editor-type caption-editing subsystem for the sections of spokenutterances in which voice recognition results have a relatively lowlevel of accuracy.
 6. The hybrid-captioning system of claim 1, whereinthe video is real-time video, such that the captions are edited for thereal-time video.
 7. The hybrid-captioning system of claim 1, wherein thevideo is recorded video, such that the captions are edited for therecorded video.
 8. The hybrid-captioning system of claim 1, whereingeneration of the captions is independent of an input path selected fromthe group of input paths essentially consisting of: a microphone path,and a file path.
 9. A method comprising: in relation to video for whichcaptions are to be edited, receiving user input as to a current sectionof the video for which captions have been generated, within aneditor-type caption-editing subsystem in which captions are edited forspoken utterances within the video on a groups-of-lines basis withoutrespect to particular lines of the captions and without respect totemporal positioning of the captions in relation to the spokenutterances; where the user input corresponds to termination of theeditor-type caption editing subsystem, transmitting the captionsgenerated for the current section to a general-matching subsystem; thegeneral-matching subsystem transmitting the captions generated for thecurrent section to a line-based caption-editing subsystem in whichcaptions are edited for spoken utterances within the video on aline-by-line basis with respect to particular lines of the captions andwith respect to temporal positioning of the captions in relation to thespoken utterances; otherwise, transmitting the captions generated forthe current section to a particular-matching subsystem; and, theparticular-matching subsystem transmitting the captions generated forthe current section back to the editor-type caption-editing subsystem.10. The method of claim 9, further comprising, where temporalpositioning of the captions has been determined as to the currentsection of the video, transmitting the captions generated for thecurrent section to the line-based caption-editing subsystem from theeditor-type caption-editing subsystem.
 11. The method of claim 9,further comprising, where the captions generated for the current sectioncomprise one or more predicted character strings, presenting thepredicted character strings to the user.
 12. The method of claim 9,wherein the particular-matching subsystem is to match spoken utterancesof the current section of the video to the captions that have beengenerated for the current section, where the current section of thevideo corresponds to a group of lines of captions without demarcation ofthe lines.
 13. The method of claim 9, wherein the general-matchingsubsystem is to match spoken utterances of the current section of thevideo to the captions that have been generated for the current section,where the current section of the video corresponds to one or moreindividually demarcated lines of captions.
 14. The method of claim 9,further comprising, after the captions generated for the current sectionhave been transmitted to the general-matching subsystem, thegeneral-matching subsystem performing: performing character-based linematching as to the captions that have been generated for the currentsection to demarcate one or more lines of the captions; where not all ofthe captions have been matched via character-based line matching,performing phoneme-based character matching as to the captions that havebeen generated for the current section; and, receiving a previouslydetermined temporal positioning of the captions in relation to spokenutterances of the current section of the video.
 15. The method of claim9, further comprising, after the captions generated for the currentsection have been transmitted to the particular-matching subsystem, theparticular-matching subsystem performing: determining avoice-recognition rate of the captions that have been generated for thecurrent section of the video; where the voice-recognition rate isgreater than a threshold, returning the captions as one or morepredicted character strings; otherwise, where the voice-recognition rateis not greater than the threshold, performing phoneme-based charactermatching as to the captions that have been generated for the currentsection; where the phoneme-based character matching results in a matchvalue greater than a predetermined threshold, returning the captions asthe one or more predicted character strings; and, otherwise, where thematch value is not greater than the predetermined threshold, returningindication that no matching has occurred.
 16. The method of claim 9,further comprising repeating the method for each of a plurality of othersections of the video.
 17. The method of claim 9, wherein the video isreal-time video, such that the captions are edited for the real-timevideo.
 18. The method of claim 9, wherein the video is recorded video,such that the captions are edited for the recorded video.
 19. An articleof manufacture comprising: a tangible recordable data storage medium;and, means in the medium for selecting an editor-type caption-editingsubsystem or a line-based caption-editing subsystem to provide captionsfor each of a plurality of sections of spoken utterances of video, inaccordance with a predetermined criteria, wherein the editor-typecaption-editing subsystem is that in which captions are edited forspoken utterances within the video on a groups-of-lines basis withoutrespect to particular lines of the captions and without respect totemporal positioning of the captions in relation to the spokenutterances, and where the line-based caption-editing subsystem is thatin which captions are edited for spoken utterances within the video on aline-by-line basis with respect to particular lines of the captions andwith respect to temporal positioning of the captions in relation to thespoken utterances.
 20. The article of manufacture of claim 19, whereinthe predetermined criteria comprises a certainty level of voicerecognition as to each section of spoken utterances within the videobeing greater than a predetermined threshold.